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Abstract 

We construct a parametrization of deep-inelastic structure functions which retains information 
on experimental errors and correlations, and which does not introduce any theoretical bias while 
interpolating between existing data points. We generate a Monte Carlo sample of pseudo-data 
configurations and we train an ensemble of neural networks on them. This effectively provides 
us with a probability measure in the space of structure functions, within the whole kinematic 
region where data are available. This measure can then be used to determine the value of the 
structure function, its error, point-to-point correlations and generally the value and uncertainty 
of any function of the structure function itself. We apply this technique to the determination 
of the structure function F2 of the proton and deuteron, and a precision determination of the 
isotriplet combination F2[p-d]. We discuss in detail these results, check their stability and 
accuracy, and make them available in various formats for applications. 
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1 Determining Structure Functions 



Recently, a considerable amount of theoretical and experimental effort has been invested in 
the accurate determination of the parton distributions of the nucleon, and in particular their 
associate errors, in view of the accurate computation of collider processes and determination 
of QCD parameters The determination of parton distributions requires unfolding the 

experimentally measurable structure functions in terms of their parton content, by resolution of 
the QCD evolution equations. However, for many applications, it is desirable and sufficient to 
deal directly with the experimentally measurable structure functions themselves. For instance, 
the nonsinglet structure function F2 coincides with the (DIS-scheme) quark distribution, and 
thus in particular the strong coupling can be extracted directly from its scaling violations 0. 
Also, knowledge of the structure function F2 is needed for data analysis: for instance, when 
extracting polarized structure functions from cross-section asymmetries 0. 

For all these applications, one needs to know the structure function F2 as a function of x and 
Q^, and not only for the particular values where it has been measured. Therefore, even though 
the laborious task of disentangling the parton content of the target and solving the evolution 
equations is not necessary if one directly determines the structure function, the central problem 
is the same as in the determination of parton distributions. Namely, one has to determine a 
function, and associate error, from a finite set of measurements. Because a function is an 
infinite-dimensional object, this is an a priori ill-posed problem, unless one makes some extra 
assumptions. The simplest way of implementing suitable extra assumption is to assume that 
the given function has a certain functional form, parametrized by a finite number of parameters, 
which can then be fitted from the data. However, the choice of a specific functional form is in 
general quite restrictive: therefore, it is thus a source of bias, i.e. systematic error which is very 
difficult to control. Furthermore, when fitting a fixed functional form to the data, it is very 
hard to obtain a determination not only of the best-fit parameters, but also of their errors (and 
correlations): systematic uncertainties are difficult to take into account p|, ^, and conventional 
error propagation and determination of the covariance matrix in a fixed parameter space is 
unwieldy and may fail because of linearization problems. 

A way out of these difficulties, based on the direct determination of the measure in the 
space of parton distributions by Monte Carlo sampling has been proposed 0, Q . However, it is 
computationally quite intensive, and it has not yet led to a full set of parton distributions with 
errors. Here, we propose a new approach to the determination of structure functions which also 
aims at the determination of the measure in the space of structure functions. Our approach is 
based on the use of neural networks as basic interpolating tools. The pecuhar feature of neural 
networks which we will exploit is that they can provide an interpolation without requiring any 
assumption (other than continuity) on the functional form of the underlying law fulfilled by 
the given observable. Hence, given a sampling of the measure in the space of function over a 
finite set of points, the neural networks can provide an unbiased interpolation which gives us 
the measure for all points, at least within a range of x and where the sampling provided by 
the data is fine enough. 

We will use this procedure to construct parametrizations of the structure function F2 for the 
proton, the deuteron, and an independent precision determination of the nonsinglet combination 
F2 — F2. We will analyze in detail the properties of our results, and show that they reproduce 
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correctly the information contained in the data, while interpolating smoothly between data 
points. We will also see that when the separation of data points in x and is smaller than a 
certain correlation length, the neural net manages to combine the corresponding experimental 
information, thus leading to a determination of the structure function which is more accurate 
than that of individual experiments, and we will devise statistical tests to ascertain that this 
is achieved in an unbiased way. 

In Sect. 2 we will introduce the problem of constructing a parametrization of structure 
functions, the data that we will use, discuss previous solutions of this problem, and moti- 
vate our alternative approach. In Sect. 3 we will then provide some general background on 
neural networks, and in Sect. 4 we will describe our method for the construction of a neural 
parametrization of structure functions. Finally, in Sect. 5 we will turn to a discussion and 
assessment of our results. First we will discuss the theoretically simpler nonsinglet structure 
function: we will examine the compatibility of different experiments, and the ability of the 
neural networks to combine different pieces of experimental information. Then, we will tackle 
the determination of the individual proton and deuteron structure functions, and discuss the 
accuracy of our results. 

2 Parametrization of Structure functions 

Our aim in this paper is to construct a simultaneous parametrization of the structure functions 
F2 for the proton and deuteron: we start from a set of measurements of a structure function, 
say F2, for a discrete set of values of x and Q^, and we wish to determine F2{x, Q^) as a 
function defined in a certain range of x and Q^. We choose the structure function F2 as deter- 
mined in neutral-current DIS of charged leptons, since an especially wide set of experimental 
determinations of it is available, and also because it is of direct theoretical interest. Studies of 
charged-current, neutrino, and polarized structure functions will be left to future work. 

A simultaneous determination of the structure functions for proton and deuteron is very 
useful for applications, since it amounts to a simultaneous determination of the isospin singlet 
and triplet components of the structure function. Also, such a simultaneous determination 
poses peculiar theoretical problems, given that we aim at a determination of the full set of 
correlations. 

In fact, we will provide both a determination of the correlated pair of proton and deuteron 
structure functions, as well as a separate determination of the isotriplet component. This is 
both useful and interesting because, as we shall see, data on the isotriplet combination turn 
out to be affected by much larger percentage errors, i. e. to have a much less favourable signal- 
to-noise ratio: the isotriplet is generally a small number obtained as the difference of two 
numbers which are by an order of magnitude larger. Therefore, on the one hand a more precise 
determination can be obtained by a dedicated analysis, rather than just taking a difference, 
on the other hand this determination will turn out to have peculiar features which deserve a 
specific study. 

Parametrizations of structure functions have been constructed before, and used for the 
applications discussed in the introduction. In this section, we will first describe the specific 
data set that we will use, and then briefly review existing approaches to the parametrization 
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of structure functions and their shortcomings. 



2.1 Experimental data 

We will use the data of the New Muon Collaboration (NMC) ||| and the BCDMS (Bologna- 
CERN-Dubna-Munich-Saclay) Collaboration [^] on the structure function F2 of proton and 
deuteron. These data provide a simultaneous determination of the proton and deuteron struc- 
ture functions in the same kinematic region, and provide the full set of correlated experimental 
systematics for these measurements. Earlier data from SLAC are not competitive with these 
in terms of accuracy and kinematic coverage. The more recent HERA data are available in 
a much wider kinematic region, but only for proton targets. Another set of joint proton and 



deuteron measurements was performed by the E665 Collaboration |T0[. These data are mostly 
concentrated at low (and low x): 80% of the data have < 1 GeV^. They are therefore 
somewhat less interesting for perturbative QCD applications, and we will leave them out of 
the present analysis. We have checked that their inclusion would have a negligible impact on 
the determination of the structure function in the region covered by NMC and BCDMS. The 
kinematic coverage of the data which we include in our analysis is displayed in Figure |I[ A 
better coverage of the large x region at moderate might be achievable in the near future by 



combining some of the earlier SLAC data [11 1 with forthcoming JLAB data [12 



2.1.1 NMC 

The NMC data consist of four data sets for the proton and the deuteron structure functions 
corresponding to beam energies of 90 GeV (72 p and 72 d data points), 120 GeV (64 + 64 data 
points), 200 GeV (74 + 74 data points), and 280 GeV (78 + 78 data points). They cover the 
kinematic range 0.002 < x < 0.60 and 0.5 GeV^ < < 75 GeV^. The systematic errors, given 
in Ref. [H, are: 

• uncertainty on the incoming and outgoing beam energies, fully correlated between proton 
and deuteron data, but independent for data taken at different beam energies {E, E'); 

• radiative corrections, fully correlated between all energies, but independent for proton 
and deuteron (RC); 

• acceptance (AC) and reconstruction efficiency (RE) , fully correlated for all data sets; 

• normalization uncertainty, correlated between the proton and the deuteron data, but 
independent for data taken at different beam energies ((Jat). 

In this experiment, correlation between proton and deuteron data are due to the fact that both 
targets were exposed to the beam simultaneously. 

The uncertainties due to acceptance range from 0.1 to 2.5% and reach at most 5% at large 
X and Q^. The uncertainty due to radiative corrections is highest at small x and large 
and is at most 2%. The uncertainty due to reconstruction efficiency is estimated to be 4% at 
most. The uncertainties due to the incoming and the scattered muon energies contribute to 
the systematic error by at most 2.5%. The normalization uncertainty is 2%. 



3 



10"= 



10' 



10^ 



"T" 



"T" 



> NMC 
□ BCDMS 



□ □□ 
□ □ □□ 

□ □ □ □□ 

□ □ □ □ □ □ 



■¥ 



4> 



+ 4^ * ^ * ^ 



-f + + + 

-4- * * + + 

^ * * =^ ^ 



^ # + 



10" 



10" 



10^ 



Figure 1: NMC and BCDMS kinematic range. 



2.1.2 BCDMS 

The BCDMS data consist of four data sets for the proton structure function, corresponding 
to beam energies of 100 GeV (97 data points), 120 GeV (99 data points), 200 GeV (79 data 
points) and 280 GeV (76 data points), and three data sets for the deuteron structure function 
corresponding to beam energies of 120 GeV (99 data points), 200 GeV (79 data points) and 
280 GeV (76 data points). They cover the kinematic range 0.06 < x < 0.80 and 7GeV^ < 
Q2 < 280 GeV^. 

The systematic errors are: [] 

• cahbration of the incoming muon (beam) energy (fb)', 

• cahbration of the outgoing muon energy (spectrometer magnetic field) (/s); 

• spectrometer resolution (/,.); 

• absolute normalization uncertainty (ctat^); 

• relative normalization uncertainties ((Tn , Cn ). 



^Note that the fuU set of systematics is hsted in the preprint version but not in the pubhshed version of 



Ref. 
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All these sources of systematics are fully correlated for all targets and for all beam energies, 
despite the fact that measurements with different targets were done at different times, for the 



following reasons [|I3|: the calibration of the incoming beam energy E was dominated by a 
systematic uncertainty which was more stable in time than the precision of measurements; the 
calibration of the outgoing muon energy was reproducible to high relative accuracy throughout 
the experiment, independent of the target or beam energy; the resolution of the spectrom- 
eter depended on the constituent material, which did not change during the running of the 
experiment. 

The uncertainty due to beam energy is smaller than 5 %, and those due to magnetic field 
and outgoing muon energy are smaller than 10 %; all uncertainties are most relevant at large 
X and small Q^. The absolute cross-section normalization error is 3%; relative normalization 
errors are 2% between data taken with different targets, 1% between data taken at different 
beam energies for the proton, 1% between data taken at 120 GeV and 200 GeV and 1.5% 
between data taken at 200 GeV and 280 GeV for the deuteron. 

2.1.3 Correlation and covariance matrices 

The structure of the correlation matrix, expressed in terms of the sources of systematics dis- 
cussed above is as follows. For NMC 

F F- 



Oi,totOj,tot 

X {Ei Ej + E[ E'j + ACi ACj + Rd RCj + RE^ REj + a%) (1) 

where atot = \/o's'^ + F{a^ + cr|^); ag is the statistical error, cXc is the combination of correlated 
systematic errors, is the normalization error, and systematic and normalization errors are 
given in percentage. 
For BCDMS 

p.p. 

Pij = ' fh,j + fs,i fs,j + fr,i fr,j + ^^n) (2) 



with (Tat = y cr%^ + cr%^ + cr%^, where ctat^ is a global normalization error, is the relative 
normalization between different targets and aN,, is the relative normalization between different 
beam energies. 

The covariance matrix is defined as 

covij = pijaiaj. (3) 

In the sequel, we will also be interested in the covariance matrix of data for the nonsinglet 
structure function F'^'^ = F'^ — F'^. In general the covariance of two functions of 
related to the covariance of these variables by 

cov(/(f), ^(f)) = ^ 1^^^°^*^^*' ■ 
id 

For the nonsinglet structure function we get 

cov(i^^ - Ft, Ff - Ff) = cov(Ff , Ff ) + cov(i^^ F^) - cov(Ff , F^) - cov{Ff, Ff) . (5) 
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2.2 Structure function fits 



Several structure function fits have been presented in the hterature 0, |T0], They are all 



based on the idea of assuming a more or less simple functional from for the structure function 
F2 (say), and then fitting its parameters. Once the free parameters of a given functional form 
are determined by fitting to the data, one can compute any function of F2, such as e.g. its 
Mellin moments. In principle, given the error matrix of the parameters of the fit, one can also 
determine errors on F2 and functions thereof by error propagation. Clearly, however, this is 
both impractical and subject to errors which are difficult to assess, because of the numerous 
linearizations which may be needed in order to determine the error on a physical observable 
(such as a Mellin moment) and, more importantly, because of the bias which is imposed by the 
choice of a specific functional form. 

In order to overcome at least the practical difficulty of determining errors by propagation 
from the covariance matrix, a frequently adopted shortcut consists of giving a band of acceptable 
results, i.e. a pair of 'upper' and 'lower' values of the structure function, for each [x, Q^). This 
error band can be again calculated either by means of the covariance matrix itself 0, [l^, or 
by some other recipe which involves assessing the compatibility of a certain result with the 
data 0. Whereas this may be sufficient in order to estimate the size of the error related to 
uncertainties on structure functions, it is clearly inadequate for a quantitative analysis, since 
there is no way to use these error bands to uniquely determine the error on a given quantity 
which depends on the structure function. For example, suppose one wants to determine the 
error on the integral over x of the structure function: one might be tempted to identify this 
error with the spread of values of the integral computed from the upper and lower curves. This 
is however not correct, because it neglects the correlation between the structure function at 
different values of x, and it can in fact lead to a substantial overestimate of the error. 

These difficulties can be traced to the fact that when fitting a structure function one is 
trying to determine a function, i.e. an infinite-dimensional quantity, from a finite set of mea- 
surements [0. Assuming a functional form of the function projects the problem onto the 
finite-dimensional subspace of functions of the given form, but at the cost of introducing a 
bias when choosing how to perform this projection. A full, unbiased solution of the problem 
requires instead the determination of the probability measure ^[-^2] in the space of functions 
F2{x,Q'^), such that the expectation value of any observable J-' [F2{x,Q'^)] which depends on 
F2 can be found by averaging with this measure: 

^ [F2{x,Q^)] ) = j VF2J' [F2{x,Q^)\ V[F2l (6) 

where the integral is a functional integral over the space of functions F2 0. For example, 
[^2(0:, Q"^)] could be a Mellin moment of -F2(x, Q"^). 

A practical way of determining this probability measure, in the context of determining 
parton distributions, has been suggested in Ref. 0]. The measure is determined by means of 
a Monte Carlo approach coupled to Bayesian inference: one builds a Monte Carlo sample of 
the space of functions, starting from an essentially arbitrary assumption on the form of the 
measure, and then one apphes Bayesian inference in order to update this measure based on the 
available data. 
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Here, we suggest a different strategy to attain the same goal, in the related context of 
fitting structure functions: we use the available data to generate a Monte Carlo sample of the 
functional measure at a discrete set of points, and then use neural networks to interpolate 
between these points, thus generating the desired representation of the probabihty measure 
in the space of functions. After introducing in the next section the general features of neural 
networks, we will give a detailed explanation of our procedure in Sect. 4. 

3 Neural networks 

Artificial neural networks provide unbiased robust universal approximants to incomplete or 
noisy data. Applications of artificial neural networks range from pattern recognition to the 
prediction of financial markets. In particular, artificial neural networks are now a well es- 
tablished technique in high energy physics, where they are used for event reconstruction in 
particle detectors |T^. Here we will give a brief introduction to the specific type of artificial 
neural networks (henceforth simply neural networks) which we shall use for our analysis of 
structure functions. In particular, we will describe multilayer feed-forward neural networks (or 
perceptrons) , and the algorithm used to train them (back-propagation learning). 

3.1 Multilayer neural networks 

An artificial neural network consists of a set of interconnected units (neurons). The state or 
activation of a given i-neuron, ^j, is a real number, determined as a function of the activation of 
the neurons connected to it. Each pair of neurons is connected by a synapsis, characterized 
by a real number (weight). Note that the weights need not be symmetric. The activation 
of each neuron is a function g of the difference between a weighted average of input from other 
neurons and a threshold 6*^: 



The activation function g is in general non-linear. The simplest example of activation 
function g{x) is the step function g{x) = Q{x), which produces binary activation only. However, 
it turns out to be advantageous to use an activation function with two distinct regimes, linear 
and non-linear, such as the sigmoid 



This function approaches the step function at large f3; without loss of generality we will take 
j3 = 1. The sigmoid activation function has a linear response when x ~ 0, and it saturates 
for large positive or negative arguments. If weights and thresholds are such that the sigmoids 
work on the crossover between linear and saturation regimes, the neural network behaves in a 
non-linear way. Thanks to this non-linear behaviour, the neural network is able to reproduce 
nontrivial functions. 




(7) 
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Figure 2: A three-layer feed-forward neural network consisting of input, hidden and output 
layers. 

We will in particular consider Rosenblatt's perceptrons, also known as multilayer feed- 



forward neural networks [16]. These networks are organized in ordered layers whose neurons 
only receive input from a previous layer. For L layers with ni, . . . ,ni units respectively, the 
state of the multilayer neural network is established by the recursive relations 

^ = 9{^Vt^~ef^^ ^ = l.....nu / = 2,...,L, (9) 

where represents the state of the neurons in the l^^ layer, uf^ the weights between units in 

the (/ — 1)*'' and the layers, and Of"* the threshold of the i^^ unit in the layer. The input 
is the vector ^^^^ and the output the vector 

Multilayer feed-forward neural networks can be viewed as functions F from M"^ M"^, 
parametrized by weights, thresholds and activation function. 



• (10) 



For given activation function, the parameters can be tuned in such a way that the neural 
network reproduces any continuous function. The behaviour of a neural network is determined 
by the joint behaviour of all its connections and thresholds, and it can thus be built to be 
redundant, in the sense that modifying, adding or removing a neuron has little impact on 
the final output. Because of these reasons, neural networks can be considered to be robust, 
unbiased universal approximants. 

3.2 Learning process for neural networks 

The usefulness of neural networks is due to the availability of a training algorithm. This 
algorithm allows one to select the values of weights and thresholds such that the neural network 
reproduces a given set of input-output data (or patterns). This procedure is called 'learning' 
since, unlike a standard fitting procedure, there is no need to know in advance the underlying 
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rule which describes the data. Rather, the neural network generalizes the examples used to 
train it. 

The learning algorithm which we shall use is known as supervised training by back-propa- 
gation. Consider a set of input-output patterns (x, z) which we want the neural network to 
learn. The state of the neural network is generically given by 

«f'=9('4"). ft? =E-r'«r '=1,. i=i,...,i. (11) 

Input -output patterns correspond to pairs of states of the first and last layers, which we 
shall denote as 

X = 

o(x) = (^2) 

The goal of the training is to learn from a given set of data patterns, which consist of associations 
of a given input with a desired output. For any given values of the weights and thresholds, 
define the error function 

A=l i=l 

where Hp is the number of data, i.e., the number of input-output patterns. The error Eq. (|T3|) 
is the quadratic deviation between the actual and the desired output of the network, measured 
over the training set. 

Given a fixed activation function g, Eq. (jl^), the error function is a function of the weights 
and thresholds. It can be minimized by looking for the direction of steepest descent in the 
space of weights and thresholds, and modifying the parameters in that direction: 



. (0 9E 



5e\' = -7] 



(14) 



d9^ 



(0 



where 1] fixes the rate of descent, i.e. the 'learning rate'. 

The nontrivial result on which training is based is that it can be shown that the steepest 
descent direction is given by the following recursive expression: 

- 4j , i-i,...,ni, J - i, . . . ,n/_i , 



^'^ij A=l 



= . = i,...,n, (15) 



do, A=i 



j(0 

i 

where the information on the goodness of fit is fed to the last layer through 



^(.)A = )^) [o„(x^) - ,f] (16) 
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and then it is back-propagated to the rest of the network by 



.(/-1)A 
j 



9'(ftr"^)^A;"M;'. (17) 

i=l 



The back-propagation algorithm then consists of the following steps: 

1. initialize all the weights and thresholds randomly, and choose a small value for the learning 
rate 77; 

2. run a pattern x"^ of the training set and store the activations of all the units; 

3. calculate A^^^^"^ Eq. ([T6|) , and then back-propagate the error using Eq. ([T7|) ; 

4. compute the contributions to 6ujfj^ and 66f^ substituting Eq. (|l^) in Eq. 

5. update weights and thresholds. 

The procedure is iterated until a suitable convergence criterion is fulfilled: for instance, if the 
value of the error function Eq. (13) stops decreasing. The back-propagation can be performed 
in "batched" or "on-line" modes. In the former case, weights and thresholds are only updated 
after all patterns have been presented to the network, as indicated in Eq. (|1^). In the latter case, 
the update is performed each time a new pattern is processed. Generally, on-line processing 
leads to faster learning, is less prone to getting trapped into local minima, and it is thus 
preferable if the number of data is reasonably small [i.e. ^ 10^). 

It is clear that back-propagation seeks minima of the error function given in Eq. (|T^), but it 
cannot ensure that the global minimum is found. Several modifications have been proposed to 
improve the algorithm so that local minima are avoided. One of the most successful, simple and 
commonly used variants is the introduction of a 'momentum' term. This means that Eq. (|l^ 
is replaced by 

= -^T-nS + « '^'^SHlast) 
5ef^ = -v-Tjj + a 6ef\\ast) (18) 

where "last" denotes the values of the 6ujfj and 6df^ used in the previous updating of the 
weights and thresholds. The parameter a ('momentum') must be a positive number smaller 
than 1. 

Also, it may be convenient to choose a more general form of the error function, such as 

rip riL 

E[co, e]^ Yl E(^^(^^) - ^fWABio.i^'') - zf) , (19) 

A,B=1 i 

where Vab is a x Hp matrix which can accommodate weights and correlations between data, 
such as, typically, the inverse of the covariance matrix. If the matrix Vab is diagonal, the 
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back-propagation algorithm is essentially unchanged, up to a rescaling of A- Eq. ([TBD by 
the corresponding matrix element Vaa- However, if the matrix Vab is not diagonal [i.e. the 
data are correlated), Eq. (|16D must be substituted by a non-local expression which involves 
the sum over all correlated data. Clearly, in this case batched training is preferable, so that all 
correlated patterns are shown to the net before the weights and thresholds are updated. 



4 Neural structure functions 
4.1 General Strategy 

We now describe in detail the strategy which we shall follow in order to construct a neural 
parametrization of structure functions (see Figure 3). As discussed in Sect. 2.2, the basic 
idea of our approach consists of constructing a representation of the probability measure in 
the space of structure functions. We do this in two steps: first, we generate a sampling of this 
measure based on the available data, and then we interpolate between these points using neural 
networks. The final set of neural networks is the sought-for representation of the probability 
measure. 

The first step consists of generating a Monte Carlo set of 'pseudo-data', i.e. Nrep replicas 
of the original set of N^at data points: 

p{art){k)_ k = l,...,Nrep, I = 1, . . . , N^at, (20) 

where the subscript i is a shorthand for {xi,Qf). The Nrep sets of Ndat points are distributed 
according to an A^^rfaf-dimensional multigaussian distribution about the original points, with 
expectation values equal to the central experimental values, and error and covariance equal to 



the corresponding experimental quantities. Because [|T^, [T3] the distribution of the experimental 
data coincides (for a flat prior) with the probability distribution of the value of the structure 
function F2 at the points where it has been measured, this Monte Carlo set gives a sampling of 
the probability measure at those points. We can then generate arbitrarily many sets of pseudo- 
data: the number of sets N^ep can be chosen to be large enough that all relevant properties of the 
probability distribution of the data (such as errors and correlations) are correctly reproduced 
by the given sample. We will perform a number of tests in order to determine the optimal value 
of Nrep, which will be discussed in Sect. 4.2. 

The second step consists of training Nrep neural networks, one for each set of N^iat pseudo- 
data. The key issue here is the choice of a suitable error function Eq. (|19|). A possibility would 
be to use the simplest unweighted form Eq. (|T^): 



^e^) 9] = J2 (f^ - , (21) 

i=l 



where p'>'^^*'^^^'^ is the output of the neural network for the input values of (x, Q"^) corresponding 
to the i*^ data point. Then, with suitable choices of architecture and training, we can achieve a 
value of the error such that E/Ndat is much smaller than the typical experimental uncertainty 
on each point. In other words, the trained net will then go on top of each point, on the scale of 
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experimental errors. Therefore, it will reproduce the sampling of the probability distribution 
at the measured points, and interpolate between them, thus achieving our goal. In fact, this 
happens even if E/Ndat is not significantly smaller than the experimental uncertainty, provided 
only that for each point i the shifts df^"^ = pj^'^^^'^^''^ — p("-'^*^(^^ g^j-e distributed as uncorrelated 
random variables with zero mean as k runs over replicas. Then, it is easy to see that the 
distribution of i^^*^"'^*-"^'^^ has the same average, variance, and point-to-point correlations as the 
distribution of 

This is however not an optimal solution in that it does not fully exploit the available physical 
information. To understand this, assume the value of the structure function at the same {x, Q"^) 
point is measured by two different experiments. If the measurements are independent, they 
are uncorrelated. However, since they determine the same physical observable, they cannot 
be treated as independent, and they must in fact be combined to give a single determination, 
with an error that it is smaller than either of the two measurements. Now, it is clear that 
if the structure function is measured at two points which are very close in {x, Q"^), then the 
value of the structure function at these points are likewise not independent: even though we 
do not know the functional form of the structure function, obvious QCD arguments imply that 
such a functional form exists, and it is continuous. Hence, by continuity, two neighbouring 
points cannot be treated as independent samplings of a probability distribution, and should 
be combined. This combined distribution has a variance which is in general smaller than its 
sampling at fixed points. 

This reduction in variance is automatically achieved when fitting a given function to the 
data: if we fit a function to each set of pseudo-data, then as the pseudo-data span their 
multigaussian space, the best-fit values of the parameters which parametrize the function vary 
over their own space. As the number of data points is increased, the variance of these parameters 
decreases. For instance, assuming for the sake of argument a multilinear law with k parameters, 
with k data points all parameters can be determined with an error which can be found by error 
propagation, but with N^at > k points this error is reduced by a factor Ij ^/Ndat — k. More in 
general, if the number of data exceeds the number of parameters, the error on the parameters 



will be reduced by a factor which is bounded by the so-called RCF inequality |]T8|; if there 
exists an estimate of the parameters which saturates this bound (as in the multilinear case), 
then it can be found by maximum-likelihood. 

We are thus led to think that the optimal strategy consists of choosing as an error function 
the log-likelihood function for the given data. In this case, the matrix Vab in Eq- ( P^ is 
identified with the experimental covariance matrix, so the error function is 

where covjj is the covariance matrix defined in Eq. (^. In such case, if we assume that after 
training the neural network can capture — at least locally — the underlying functional form, 
then a maximum-likelihood determination of its parameters leads to a distribution of neural 
networks with a smaller uncertainty than that based on the minimization of the naive error 
function Eq. (PT]). 

In practice, however, this choice turns out not to be viable. Indeed, as discussed in Sect. 3.2, 
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the back-propagation algorithm with the error function Eq. (22) is non-local, and has poor 
convergence properties, especially when the number of data is large, as in our case. Note that 
this difficulty is related to the number of correlated data points, and thus it persists even if 
the number of correlated systematics is actually small, despite the fact that in such case the 
inversion of the correlation matrix is not a problem [^. 

A strategy that is both physically effective and numerically viable consists instead of choos- 
ing as an error function the log-likelihood calculated from uncorrelated statistical errors, namely 



E""[^.0] = t^ ^ (23) 



'=1 (^Ls 



where (J^^g^^ is the statistical error on the z*'^ data point. Because the error function is now 
diagonal, use of the back-propagation algorithm is no longer problematic. In order to under- 
stand the effect of minimizing the error function Eq. (|2^), write the probability distribution of 
the i^^ data point as k runs over replicas as 

piartm ^ pie^P) + , + .(J (24) 

p=l 

where pj^*^^^^ is the experimental value, all r^'^^ are distributed in k as univariate gaussian random 
numbers with zero mean, is the statistical error, and are the Ngyg systematics. Whereas 
there are N^at independent rj s, the number of independent variables rj^p for each p is smaller, 
in that there is one single random variable rj^p for all correlated data points. The maximum- 
likelihood fit of the error function Eq. (^31) to a set of data distributed according to Eq. 



determines the values of F^""^*-* which best fit the distribution of 

sys 

p=l 

In this case, the distribution of best-fit functional neural networks as k runs over replicas 
will have the smaller uncertainty which is obtained by maximum-likelihood combination of 
the statistical errors. However, it will still reproduce the (unimproved) systematic errors as k 
runs over the Monte Carlo sample. The distribution of systematic errors will remain unbiased, 
provided only that for each point i the distribution of best-fit values of pj^"'^^^^''^ about pj^"-^^^^''^ as 
k runs over replicas is such that jg uncorrelated to r^'^^. In other words, the best-fit nets 

provides a probability distribution which optimally combines statistical errors, but reproduces 
the systematic errors of the Monte Carlo sample. 

Just as in the first step, also in this second step of the procedure it will be necessary to verify 
whether the number of replicas is large enough that the statistical properties of the distributions 
are properly reproduced. Furthermore, it will be necessary to verify that the network fitting 
procedure did not introduce any bias, and specifically that if the statistical variance is reduced 
this is due to having combined data points, and not to a bias of the fitting procedure. To this 
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purpose, we will develop a number of statistical tools which will be discussed in Sect. 5.1 when 
presenting our results. 

At the end of our procedure, we obtain a set of Nrep trained neural networks which provides 
us with a representation of the probability measure in the space of structure functions, and thus 
it allows us to determine any observable by averaging with this measure, Eq. (P). In particular, 
estimators for expectation values, errors and correlations are 



{net) \ 

k=l 



/ rep N.^- 



N rep 



af"^*) = M (f/"^*))' \ - (rt'^y (27) 

y \ \ J I rep \ / rep 

(net) \ * I rep \ * / rep \ / rep (0<S.\ 

Pij ~ 0^) 0^) ' 

where i, j denote two pairs of values (xj, Q]), {xj, Q'j) (not necessarily in the original data set). 
More in general, any functional average, defined by Eq. (^, can be estimated by 

Nr 



^ [F,{x,Q')] ) = ^ [F2("^*)W(x,g2)] . (29) 



'rep 



4.2 Generation of artificial data 



The Monte Carlo set of pseudo-data can be generated by using equations of the form of 
In particular, for the NMC experiment we have 
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(exp) 



(30) 
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where each r- is distributed as an univariate gaussian random number with zero mean over 



the replica sample [i.e., as k varies), F^^^^'^ is the central experimental value, and the various 
sources of systematics were discussed in Sect. 2.1.1. Since statistical errors are uncorrected, 
there is an independent r^^s for each data point. Correlated systematics are correctly reproduced 
by taking the same and for all data corresponding to all energies and targets, the same 
r4 for all targets, and the same ri, r2 and r-j for all targets. 
Likewise, for BCDMS we have 
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Figure 3: Flow chart for the construction of the parametrization of structure functions 



where now ri,r2,r3 and take the same value for all data (all energies and targets), re is fixed 
for a given target, ry is fixed for given beam energy. 

In order to determine a suitable value of Nrep, we now compare expectation values, variance 
and correlations of the pseudo-data set with the corresponding input experimental values, as 
a function of the number of replicas. A graphical comparison is shown in Figure. 4, where we 
display scatter plots of the central values, errors and correlation coefficients for samples of 10, 
100 and 1000 rephcas, for the proton, deuteron, and nonsinglet structure functions. 

A more quantitative comparison can be performed defining the following quantities 

• Average over the number of replicas for each experimental point i {i.e. a pair of values of 
X and Q'^) 



p{art) 



= ^E^^"^^'^- (32) 



rep Nrep ^ 
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Figure 4: (Fr> ) vs. (af "^ ) vs. and (p,^^^ ) vs. p,^^^ with iV,,^ = 10 
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(red), 100 (green) and 1000 (blue) replicas. 



Variance of {Pj^"'^^^) 
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(art) 



J/frt^'A") -(Ff-))' . (33) 
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Correlation and covariance of two points 



p{art) p{art)\ _/^{«^*)\ 



(art) \ " ^ I rep \ * / rep\ / rep l'iA\ 



covS^*) = fi;''^at'^a^''^. (35) 

These quantities provide the estimators of the experimental values, errors, and correlation 
which one extracts from the sample of artificial data. 
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Table 1: Comparison between experimental and generated artificial data for the proton struc- 
ture function. The experimental data have (a^'^^^'''^ = 0.0116, l^p^^^P^\ = 0.382 and 
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(36) 
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We can similarly define {V[a'^'"'^^] ) , ( PEfa^'^''*)] ) , ( V'ip^"'*)] ) and ( V^[cov('^''*)] 

\ / dat \ / dat \ / dat \ / dat 

These estimators indicate how close the averages over generated data are to the experi- 
mental values. 
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Table 2: Comparison between experimental and generated artificial data for the deuteron 



structure function. The experimental data have ( a 
= 3.3 X 10-5. 
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where the scatter variances are defined as 
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Similarly we define r[cov*^'"'*'']. The scatter correlation indicates 

the size of the spread of data around a straight line in the scatter plots of Figure ^. 
Specifically, r[(T*^"'^*^] = 1 implies that {o'l"''^^^) is proportional to a^^^^^^ (and similarly for 
gi^Q^)^ Note that the scatter correlation and scatter variance are not related to 
the variance and correlation Eqs. ([33| - |35D or averages thereof. 
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Table 3: Comparison between experimental and generated artificial data for the nonsinglet 



structure function. The experimental data have ( a 
= 8.4 X 10-6. 
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where a^^^^"* is the experimental error on the i^^ data point, an so forth. These quantities 
are interesting because even if the scatter correlation r Eq. (^) is very close to one 
(so all points in the scatter plots Figure 4 lie on a straight line), there could still be a 



systematic bias in the estimators Eqs. (p2| - p5D . For example, if for all points i the variance 
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The values of these quantities for samples of 10, 100 and 1000 replicas and for the three 
structure functions under consideration are shown in Tables 1-3. From these tables, it can 
be seen that the scaling of the various quantities with N^ep follows the standard behaviour of 



Gaussian Monte Carlo samples |T^. For instance, the variance on central values should scale as 
'^/Nrep, while the variance on the errors should scale as 1/ ^JN^^. This implies that a smaller 
sample is sufficient in order to achieve a given accuracy on means than it is required to reach 
the same accuracy on errors. Also, because = [1 - {p'^^^'^^ff l^rev, the estimated 

correlation fluctuates more for small values of p^^^P\ and thus the average correlation {p^"'^*^)dat 
tends to be larger than the corresponding experimental value {p^^^^^)dat- Notice finally that a 
larger sample is necessary in order to achieve a given percentage error on the nonsinglet than 
on the proton and deuteron. This is a consequence of the fact that the signal-to-noise ratio is 
smaller for the nonsinglet: the proton and deuteron structure functions are roughly of the same 
size, so the absolute value of the nonsinglet structure function is typically by about a factor 10 
smaller. 

Inspection of the tables shows that in order to reach average scatter correlations of 99% and 
percentage accuracies of a few percent on structure functions, errors and correlations a sample 
of about 1000 replicas is necessary. Notice in particular that with 1000 replicas the estimated 
correlation fiuctuates about the true value, rather than systematically overshooting it. 



4.3 Building and training Neural Networks 

The choice of an optimal structure of the neural nets and of an optimal learning strategy 
cannot follow fixed rules and must be tailored to the specific problem. Here we summarize and 
motivate the specific choices which we made for our fits. 

Number of layers 

The optimal number of layers in a neural network depends on the complexity of the specific 
task it should perform. It can be proved that any continuous function, no matter how 
complex, can be represented by a multilayer neural network with no more than three layers 
(input layer, a middle hidden layer, and output layer). However, in practice, using a single 
hidden layer may require a very large number of units in it. Thus, it can be more useful to have 
two hidden layers and a smaller number of neurons on each. We will only consider networks 
with two hidden layers. 

Number of hidden units 

The number of hidden units needed to approximate a given function JF is related to the num- 
ber of terms which are needed in an expansion of JF over the basis of functions g{x), g{g{x)), . . .. 
There exist several techniques to determine the optimal number of units. Here we have taken 
a pragmatic approach. First, we carry out all the learning with a small number of units. Then, 
we restart the whole learning adding new units one by one, until stability of the error function 
is reached. In this way, an optimal architecture is found, which is large enough to reproduce 
faithfully the training patterns, but small enough that training is fast. The final results which 
we will present are obtained with the architecture (4-5-3-1). The reason why there are four 
input nodes will be discussed below. 

Activation function 

We have taken a sigmoid activation function Eq. (^. The sensitivity of the neural network 
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can be enhanced by substituting the sigmoid activation function in the last layer with the 
identity (linear response). This avoids saturation of the last layer neurons and leaves space 
for more sensitive responses. For this reason, we have adopted linear response in the last 
layer. With these choices, and the architecture discussed above, the behaviour of the network 
is determined by the values of 47 free parameters (38 weights and 9 thresholds). 
Resettling 

If the activation C,i of input (output) node i is numerically large, in order for the activation 
function to be in the nonlinear response regime, the weights Uij must be very small. However, 
if the activation is very large, the shifts Eq. (^) in the first stages of the learning process are 
also very large, thus leading to an erratic behaviour of the training process, unless the learning 
rate rj Eq. (p!5| ) is initially taken to be very small and then changed along the training. In order 
to avoid this, it is convenient to rescale both the input and the output data in such a way that 
that typical activations are ~ 0{1) for all i. In practice, we have rescaled all values of {x, Q^) 
and the structure function F2 in such a way that they span the range 0.1 — 0.9. This ensures 
that activation functions are in the nonlinear regime when the absolute values of all the weights 
settle to values close to one. 

Input 

We have used as input variables x, Q^, logx and logQ^. The choice of taking logx and 
log along with x and is motivated by the expectation that these are the variables upon 
which F2 naturally depends. Note that this is not a source of theoretical bias: if this expectation 
turns out to be incorrect and these variables are useless, after training the neural network will 
simply disregard them. If, however, the expectation is correct this choice will speed up the 
training. We have checked that neural networks trained just with x and perform as well as 
the ones we use but need longer training times. 

Theoretical assumptions 

The only theoretical assumption on the shape of ^2(0;, Q^) produced by the neural nets is 
that it satisfies the kinematic bound ^2(1,(5^) = for all Q^. In general, a constraint can be 
enforced by adding it to the error function Eq. (0), so that configurations that violate the 
constraint are unfavourably weighted. In our case, the constraint is local in x and so its 
implementation is straightfoward: it can be enforced by including in the data set a number 
of artificial data points which satisfy the constraint, with suitably tuned error. More general 
cosntraints, such as the momentum sum rule in a parton fit, would require a term which makes 
the error function nonlocal in x and Q^, and would thus make the training computationally 
more intensive. 

We have thus added to the data set 10 artificial points at a; = 1 with equally spaced values of 
on a linear scale between 20 and 200 GeV^ (which ensures that the bound is respected even 
when extrapolating beyond the data region). The choice of the error on these points is very 
delicate: if it is too small, the neural networks will spend a significant fraction of their training 
time in learning these points. In such case, the kinematic constraint F2{x = 1, Q^) = is 
enforced with great accuracy, but at the expense of the fit to experimental data. We have thus 
taken the error on these points at x = 1 to be of the same order of the smallest experimental 
error, namely, 10~^ for and F2, and -\/2 x 10~^ for F2 — F2. 
Training patterns 

Since the number of data is reasonably small, we have adopted on-line training (see Sect. 3.2): 
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Figure 5: Fit of the nonsinglet structure function F2—F2 to a subset of BCDMS data points for 
increasing training lengths: insufficient training (left); j normal training (middle); overlearning 
(right). The variable in abscissa is an arbitrary point number. 



weights are updated after each pattern is shown to the net. If the experimental data were shown 
to the net in a fixed order, a bias could arise, e.g. because of the regularity of some input pat- 
terns. Therefore, we have shown the data to the network using each time a random permutation 
of their order. Furthermore, the data are always shown at least 10^ times, in order to allow 
compensations of different variations. Because we use a pseudo-random number generator to 
compute permutations with periodicity 6 x 10^ which is smaller than a typical number of train- 
ing cycles (see Sect. 5), there might still be periodic oscillations in behaviour of the best-fit 
network. However, we have checked that these oscillations do not significantly affect the fit. 
Learning parameters 

The choice of the learning rate 77 Eq. (|T^) is particularly important: large values of rj lead 
to an unstable training process, while small values lead to slow training which may get trapped 
in local minima. In practice, it is convenient to vary the value of r] during the training: first, 
one looks for the region of the global minimum with a larger value of rj. Once this region is 
located, the learning rate is reduced so that the probability of jumping away is small and it is 
then easier to deepen into the minimum. The momentum term a Eq. (|18D is closely connected 
to the learning rate: an increase of a implies an increase of the effective learning rate. The 
optimum value of a depends on the updating procedure used. 

For our training, we have chosen to first minimize the error function Eq. (|2T|) , and then 
switch to the minimization of Eq. (|23|): first we look for the rough location of the minimum, 
and then we refine its search. The first set of training (of the order of 10^ cycles) is performed 
with rj ~ 10~^, and the second (of the order of 10^ cycles) with r] ~ 10~^ (see Sect. 5). We will 
always take a = 0.9. 

Convergence of the training procedure and parametrization bias 

As the training of the neural network progresses, the neural net reproduces in a more and 
more detailed way the features of the given data set. It follows that the problem of determining 
convergence of the training procedure is tangled with the problem of avoiding a possible fitting 
bias. Indeed, if the net is too small or the training is too short, it will artificially smooth 
out physical fiuctuations in the data, whereas if the training is too long, the net will also fit 
statistical noise ('overlearning'). Convergence must be therefore assessed on the basis of a 
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criterion which guarantees that the trained nets faithfully reproduce the local functional form 
of the data, without fitting statistical fluctuations, but also without imposing a smoothing 
bias. This issue is illustrated in Figure ^ where we show the form of a neural network with the 
architecture and features discussed above, trained to a small, arbitrarily chosen subset of 30 
BCDMS data points for the structure function F2 — -F^, with increasingly long training cycles. 
With this small data set, chosen for the sake of illustration, the architecture of the nets is very 
redundant, and overlearning is easily achieved. The first plot (very short training) shows that 
the neural network strongly correlates data points, but it does not quite reproduce the behavior 
of data. The second plot corresponds to a longer training, which leads to a more or less ideal 
fit. In the last plot (very long training) the neural network follows the statistical fluctuations 
of individual data points (overlearning). Note that the abscissa in this plot is an arbitrary 
point number; jumps in the best-fit function are not significant and simply reflect the fact that 
points of neighbouring number might correspond to rather different values of (x, Q^). 

Extreme overlearning as displayed in Figure ^ can be avoided by using a number of parame- 
ters in the neural network which, while still being largely redundant so as to avoid parametriza- 
tion bias, is significantly smaller than the number of points to be fitted. However, an accurate 
assessment of whether convergence has been reached can only be obtained by means of statisti- 
cal indicators of the goodness of fit, tailored to the particular problem. These will be discussed 
in the next section. By means of such indicators, one may determine an optimal training length, 
and then subject all networks to training of this fixed length. This is more convenient than 
stopping the training on the basis of a convergence criterion, e.g. on the value of the error 
function, because this latter choice would tend to artificially damp statistical fluctuations of 
the pseudo-data sample, and it could thus lead to biased results. 

5 Results 

We will present results for the proton, deuteron and nonsinglet structure function F2. We have 
decided to train a separate neural network for the nonsinglet in view of precision applications. 
This is because the proton and deuteron structure functions are roughly of the same size, while 
their difference is typically smaller by a factor 10. Hence, in order to achieve a 1% accuracy on 
F2 — F2 we would need approximately a 0.1% accuracy on F2 and F2. An individual fit of the 
proton, deuteron, and nonsinglet allows us to achieve a comparable accuracy on all structure 
functions. After discussing the general aspects of the assessment of the quality of the fits, we 
will discuss separately the nonsinglet, proton and deuteron fits. 

5.1 Fit assessment 

As discussed in Sect. 4.1, the determination of the probability density in the space of structure 
function is based on the training of a neural network on each of the replicas of the original data 
set. We must therefore first, assess whether the training of each network is sufficient, and then 
verify that the set of networks represents the probability density of the data in a faithful and 
unbiased way. 

The goodness of fit provided by each net is measured by the error function E^^^ Eq. ( PBj ) 
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which is minimized by the training process; the average goodness of fit is thus 
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where we have further normahzed to the number of data so that Eq. (^2]) coincides with the 
mean square error per data point per rephca. We also define the normahzed error function 
computed for the original set of data (rather than one of the replicas) 
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where i^^*^"^*-"^'^-* is the prediction from a neural net trained on the set of central experimental 
values F^^^^\ 

The estimator for the central value of each data point is given by the expectation value 
p(net)\ -g^^ (p6D. The goodness of the fit to the original data provided by this estimator is 
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where covjj is the covariance matrix Eq. (|^), and we have again normalized to the number of 
data points. Note that the number of degrees of freedom is not just the difference of N^at and 
the number of parameters of the neural network, because the neural network is by construction 
redundant (see Sect. 4.3): a network with a smaller number of parameters could lead to fits of 
the same quality, though with a longer training. However, the data points are about 500 (see 
Sect. 2) while the parameters of the net are about 50 (see Sect. 4.3), corresponding to some 



10 or 20 truly independent degrees of freedom: hence, the expression Eq. (|4^) differs only by a 
few percent from the per degree of freedom. 

In order to assess how well experimental errors are reproduced by their neural estimator 
Eq. (|27| ) we define the corresponding average variance and percentage errors 
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We can similarly define percentage errors on the correlation and covariance {V[pi^^i)yj^ and 
/l^[cov("^*)]\ . 

\ / dat 
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A point-by-point measure of the agreement between network estimators and their experi- 
mental counterparts is provided by defining the scatter correlation of central values, in analogy 
to Eq. (H) 

/ p(exp) / p{net)\ \ _ / p{exp)\ / / p(net)\ \ 
\Tp{net)l \ ^ ' rep/ dat \ I dot W I rep/ dat ( 

where the scatter variance of the network is 

-^"-(((^'-")J\,.-(((^""")„,).J- («) 

Similarly we define r[cov^"^*'']. As discussed in Sect. 4.2, the scatter 

correlation — 1 < r < 1 indicates how closely the network estimator follows its experimental 
counterpart as the index i runs over experimental points. Specifically, r[cr'^"'^*)] !^ 1 implies that 
l^^{net)\^ is approximately proportional to af'^^^ (and similarly for r[F*^"'^*^] etc.). The coefficient 
of proportionality can then be found by comparing the average experimental variance Eq. (^1]) 
with the average network variance 

,(-)\ =^Y.-^-\ (49) 

/ dat Ndat ^ 
1=1 

A substantial deviation of the scatter correlation from one indicates that the neural estimator 
is not following the pattern of experimental variations. This might indicate that the neural 
network is incapable of reproducing the shape of the data, or else that it is successfully reducing 
fiuctuations in the data. 

In practice, we are interested in the case in which the neural networks lead to a successful 
fit {i.e. with a good value of the (|4^)), while leading to a reduction of the variance, as 
discussed in Sect. 4.1, so that {cr^"'^^^) dat < {o'^'^^^^)dat- Because we are minimizing the diagonal 
error function Eq. (pSf ) and not the full likelihood, we cannot simply rely on the value of the 
average error (E) and the to assess the quality of the fit. Rather, we must verify that the 
distribution of systematic errors remains unbiased, and that if a reduction of the variance is 
observed it is not due to lack of fiexibility of the neural network. 

In order to test whether systematics are reproduced in an unbiased way, we notice that 
experimental correlations are entirely due to systematic errors, according to Eqs. (^^. It 
follows from these equations and from the definition of covariance matrix (|^) that if the networks 
reduce the statistical variance but reproduce the experimental systematics (as they ought to), 
then o",-""^*^ < a^^^^^ and p,-"*^*^ > P^T'^'^ but cov,-""^*^ ^ cov,-^^^\ So if cov^"^^*) and cov^*^^^^ are 
strongly correlated and have approximately the same mean, the systematics are reproduced in 
an unbiased way. Notice however that this condition is in general sufficient, but not necessary: 
for instance, more measurements at the same point from independent experiments have p^'^^P^ = 
identically, while p*^"^*) = 1, so in this case cov*^^^^^ and cov*^""^*^ cannot possibly be equal. 

In order to test whether the variance reduction is due to the fact that the networks are 
distributed with smaller variance around the true experimental values, we construct a suitable 
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estimator, assuming for simplicity all errors to be Gaussian. Consider a measurement value mj 
of F2, where i represents a pair of values (x, Q^). For each measurement 



rrii =ti + (Ji Si (50) 

where Si is a univariate Gaussian number with zero mean, ti is the true value of F2, and ai its 
error. The k^^ replica of generated data is then 

gf = ni, + rf ^ a,=U + {s, + rf , (51) 

where is also a univariate zero mean Gaussian random number. Assume now that the best- 
fit neural networks are distributed about the true values ti with an error dj. For the k^^ neural 
network we have 

nf^ =U + rf a, , (52) 

where, assuming that the network is an unbiased estimator, r'f^^ have zero mean, and can be 
normalized to be univariate. Because the networks are determined by fitting to g^''\ in general, 
r'f'^ will be correlated to r^'^'' and ti. 

If we assume that the number of replicas is large enough that (r^'^^J = (r'^^j = and 

\ / rep \ I rep 

r^'^^yX = Ur'f'^y^ = 1, while the number of data points is large enough that {si)dat = 

/ rep \ / rep 

and {sf)dat = 1 we get the average error Eq. (^2|) 

{E)rep = 2+ {{(y/af)dat " 2{{rr') rep{a / a)) daU (53) 

where ()dat denotes averaging with respect to data in one replica (index i) and ()rep denotes 
averaging with respect to replicas (index k). Now, define a modified average error, where the 
prediction of the fc-th network instead of being compared to the k-ih replica (as in the average 
error proper Eq. (|4^) ) are compared to the experimental data: 



l\rep ^ 

{E),,p = -^Y^^E^^) (55) 

^rep ^dat 

We get immediately 

{E)rep = 1 + {ia/a)'')dat- (56) 

We finally define the ratio 

n ^ (57) 

/rep 

which is the desired estimator. Indeed, assume that the networks displays significantly smaller 
fluctuations than the data, af"''^*'* « a^'^^^K We wish to test whether this is due to the fact that 
the network has performed substantial error reduction by combining several experimental data 
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points (yfj^*^^ into a determination of its parameters. In such case the correlation between r'f'' 
and rf'' is very weak, since r'^-^'' is determined by the values of several points g'j'^'' with j ^ i. 
Hence 

n^^-±M^. (58) 

If error reduction is substantial and {{(^ /o'^))dat « 1? then TZ ^ ^. Physically, this means that 
by combining many data, the network manages to always be closer to the true values ti than 
the replicas. If instead the network was just artificially smoothing the data, we would expect 

{(Ji/oi) > 1 and again no correlation (r'^^rf) ^ 0, so TZ^ 1. 

\ I rep 

It is instructive to consider also a case in which we observe no error reduction, af'^^^ ~ ^('=^p)_ 
In this case, we can simply check that the neural network are behaving properly by verifying 
that the scatter correlation r\a\ is high. However, the value of H. provides a cross-check. In 
such case, we expect Oi ~ 0",, and a large correlation ^r'\r\)rep ~ 1 since n\ is determined by 
g\. Hence, in this case we would expect TZ ^ 2. If instead the networks were not reproducing 
the data accurately we would again get 7^ ^ 1 as above. 

5.2 Nonsinglet 

The nonsinglet structure function is defined as 

F^'{x, Q') = Q') - F,\x, Q'). (59) 

Data for this quantity are obtained by discarding the BCDMS proton data taken with a 100 GeV 
beam, which have no deuteron counterpart, and then exploiting the fact that all remaining 
BCDMS and NMC data come in proton, deuteron pairs, taken at the same values of x and Q^. 
For NMC data, there are small difference in the binning in between proton and deuteron, 
which are however negligible on the scale of the typical variation of the structure function and 
the size of the experimental errors. 

In order to establish the length of the training, we study the behaviour of the average error 
function E^^^ Eq. ( ^3]) determined from the original set of data as a function of the number of 
training cycles (Figure 6).[|lt is seen that an extremely long training (of order of 10^ training 
cycles) is required in order for the error function to stabilize. However, if we look at the 
average error function computed from data of either of the two experiments, it is apparent that 
it is only the contribution to E^^^ from the BCDMS data which keeps improving, while -E'^^c' 
attains very rapidly a small value which is then stable. If we train the nets only to data of 
one experiment (Figures 7-8), the training required for reaching convergence with BCDMS is 
still rather longer than with NMC data, but substantially shorter than when both experiments 
are fitted at once. The greater difficulty in learning the BCDMS data is due to fact that these 
data have smaller errors. Notice that the net trained on each of the two experiments predicts 
to a large extent the other experiment once convergence has been achieved. This means that. 



Cyclic patterns in the behaviour of E^^^ as a function of the training length are due to the periodicity of the 
pseudo-random number generator which is used to determine the permutation of data in the training process, 
see Sect. 4.3. 
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Figure 6: Nonsinglet: dependence of the error on the length of training. 

even though only one experiment is used for training, the contributions to E^^^ from both 
experiments are reduced in the process — though, of course, the experiment which is not used 
has eventually a worse average E^^\ 

This suggests that an optimized training can be obtained by showing to the network BCDMS 
data more often than NMC data. Then, the BCDMS data arc learnt faster but the information 
from the NMC data is not lost. An optimal combination is found (by trial and error) when the 
BCDMS data are shown 90% of times, and NMC data 10% of times. Then, the average error 
E^^^ for the two experiments both reach convergence at the same value after about 1.8 x 10^ 
training cycles (Figure 9). Note that, after convergence has been reached, the average error 
E^^^ of the two experiment oscillate about each other, while slowly improving at the same rate. 
This training is optimized, in comparison to that of Figure 6 both because it is by more than 
a factor 5 shorter, and, more importantly, because the convergence of the values of E^^^ for the 
two experiments shows that the global minimum has been found. 

Indeed, a substantial difference in values of E^'^^ for the two data sets is not acceptable, 
because it entails that if on average E^'^^ ^ 1, in actual fact the net is overlearning one of 
the two data sets, and underlearning the other, i.e., overlearning in one kinematic region and 
underlearning in another region. This is what happens with ~ 10^ training cycles (origin of the 
plot of Fig. 6) . Unless there are reasons to beheve that errors are not correctly estimated for 
one of the data sets, this means that the neural network is getting trapped in a local minimum 
corresponding to the data set which has been learnt faster, and thus unable to find the global 
minimum. Indeed, when the global minimum is found, the values of E^^^ for the two data sets 
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Figure 7: NS training on NMC data. Figure 8: NS training on BCDMS data. 
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Figure 9: Nonsinglet weighted training (90% BCDMS, 10% NMC) 
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are approximately equal and improve at the same rate, as it happens for the weighted training 
of Fig. 9 with ^ 2 X 10^ training cycles. It is clear from Fig. 6 that if an unweighted training 
is adopted, it might still be possible to achieve comparable values of for the two data sets, 
but only with a very long training, leading to a value of E^^^ « 1. In such case, the net would 
be overlearning throughout the full data region. Hence, weighted training is mandatory if we 
wish to avoid both local minima and overlearning. 

As discussed in Sect. 4.3, it turns out to be convenient to let each network first undergo 
a preliminary short training to the simple error function Eq. (|2l|) with large learning rate rj 
Eq. (pISl). After some experimentation along the lines discussed above, the training parameters 
have been chosen to be as follows: 4.6 x 10^ cycles with learning rate t] = 4 x 10~^ and error 
function Eq. (^, followed by 1.8 x 10* cycles with r/ = 4 x 10"* 90% of the times on BCDMS 
data and 10% of the times on NMC data with error function Eq. (|23|) . The momentum term 
Eq. (P!B| ) is always set to a = 0.9. 

We have thus produced a set of 1000 networks, each trained on a replica of the original data 
set. The prediction for the structure function and its error computed from them according to 
Eqs.(p6|-p7|) are compared to the experimental data in Figure. 10. The features of this set of 
neural networks are summarized in Table 4. 

The table shows that central values are very well reproduced. The value of of the two 
data sets is almost the same, and close enough to ~ 1, so that there is neither overlearning 
nor underlearning. The statistical error is substantially reduced by the network, especially 
for NMC data which have larger experimental error. Correlations correspondingly increase, 
so that the average covariance is essentially unchanged: this indicates that the systematics is 
reproduced in an unbiased way. The expectation that 71 ~ 0.5 is beautifully fulfilled, indicating 
that the variance reduction is due to the fact that the network is learning an underlying law. 

It is interesting to observe that while 

^(net) ig highly correlated to a^^^'P) for BCDMS data, 
this does not happen for NMC. The reason for this can be traced to the fact that local informa- 
tion provided by each NMC data point is very weak: a few NMC points on top of the BCDMS 
data are sufficient to train the neural network and the remaining ones do not provide significant 
extra information. To illustrate this, in Figure 11 we show the average error computed from a 
network trained on all BCDMS data, but only 20 NMC points (7% of NMC points arbitrarily 
chosen among those where the systematics is less than the statistical error): this fit is as good 
as the fit where all NMC data are kept. Note, however, that in this case the value of 
computed from the two data sets cross and rapidly diverge as the number of cycles grows above 
1.5 X 10*^. This divergence is not observed when the full set of NMC data is included (Figure 9): 
this shows that the reduced NMC data set does not carry the full information needed for a 
detailed fit, so the use of the full data set remains necessary if we want to avoid fine-tuning. 



30 



0.003<x<0.030 



CM 



13 CM 



13 CM 



2.00 


+ NMC data 

Neural Networks 










1.75 


— ^ J 








x=0.0045 (+1.75) — 


1.50 










x=0.0080 (+1.50) — 


1.35 


— ♦ * 




g 5 




x=0.0125 (+1.25) — 


1.00 










x=0.0175 ( + 1.00) — 


0.75 


— * ^ « 


1 1 


t ■ s 


$ < 

1 


\ x=0.0250 (+0.76) _L 



3 4 5 6 7 8 910 
0.03<x<0.12 



20 30 



2.00 



1.75 



1.50 



1.25 



1.00 



0.75 



- + NMC data 
□ BCDMS data 
o Neural Networks 



T 



4 i y s s 



x=0.035 (+1.75) 
x=0.050 (+1.60) 

x=0.070 (+1.25) 

x=0.090 (+1.00) 

a ffl x=0.100 (+0.85) 
i x=0.110 (+0.75) 



10 



0.12<x<0.80 



50 100 



2.0 



1.5 



1.0 



0.5 



0.0 



-0.5 



_ <f ♦ * * g * 

_ f * • a • 

x=0.275 (+1.25) # * * * 
^=0,350 (+1,00) ♦ ft ♦ 

x=0.4,50 (+0.75) a * 

j[=0.550 (+0.50) 

x=0.650 (+0.25) 

JC=0.750 

+ NMC data 

□ BCDMS data 

o Neural Networks 



x=0.140 
x=0.180 
! X=0.2Z5 



{+2.00]_ 
( + 1.75)- 
( + 1,50)' 



10 



50 100 



31 

Figure 10: Nonsinglet data and network prediction at the corresponding (x, Q^) value. For 
clarity, the points are offset by the amount given in parenthesis. 
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Table 4: Nonsinglet results 



It is clear from Fig. 10 that the coverage of the data is such that, within the data region, 
the interpolating structure function is very well constrained by the data. In fact, the observed 
substantial error reduction is due to the fact that each new data point modifies very little the 
behaviour predicted by neighbouring data. It is therefore interesting to ask how far one can 
trust the neural networks when extrapolating outside the data region. 

At large x, it can be easily verified that the extrapolation is sufficiently constrained by the 
kinematic bound ^2(1, Q^) — that no increase in the uncertainty is observed when x goes 
outside the region of the data even at = 3 GeV^, where there are no data with x > 0.3. In the 
small X region, instead, the uncertainty increases very rapidly: for instance, at = 30 GcV^ 
the error is o" 0.004 at a; = 0.3 (bulk of the data), a ^ 0.006 at a; = 0.1 (edge of the data 
region) and a ~ 0.017 at a; = 0.01 (extrapolation by one order of magnitude in 1/a;). Similar 
behaviour is observed at other scales: hence, extrapolation at small x outside the data region 
is not predictive. 

Coming now to extrapolation in Q^, it turns out that extrapolation at large is somewhat 
more reliable than extrapolation at low Q^. For instance, at a; = 0.3 the error is o" 0.004 at 
Q2 = 30 GeV^ (bulk of the data). At low it reaches a « 0.006 at = 3 GeV^ (edge of the 
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Figure 11: Nonsinglet training on all BCDMS data and 20 NMC data 

data), and becomes a ^ 0.016 at = 1 GcV^ (extrapolation). At high Q"^ it reaches a ^ 0.004 
at = 100 GeV^ (edge of the data), and becomes a ~ 0.007 at = 300 GeV^ (extrapolation). 
Similar behaviour is observed for other values of x. The fact that the extrapolation is more 
stable at large Q'^ can be understood as a consequence of the fact that, because of asymptotic 
freedom, the large Q'^ scaling violations are smaller and smoother, and thus easier for the 
neural network to capture. Note however that, in contrast to a parton fit, here the theoretical 
information on scaling violations encoded in evolution equations is not being used, and thus 
extrapolation, even at large Q^, should be used with care. 

5.3 Proton and deuteron 

Neural networks for proton and deuteron are trained independently, but on a simultaneous set 
of pseudo-data. Namely, because proton and deuteron experimental data are correlated (recall 
Sect. 2.1), each pseudo-data replica is generated as a set of correlated proton and deuteron data 
points. Two independent proton and deuteron neural networks are then trained on the proton 
and deuteron data of each replica. This guarantees that experimental correlations between 
proton and deuteron will be reproduced by the network sample. 

5.3.1 Proton 

The training of neural networks on proton structure function data is rather more difficult 
than in the nonsinglet case, because these data have a much smaller percentage error. The 
dependence of the average error on the training length is displayed in Figure 12. The noisiness 

33 




2.00 



1.75 



1.50 




1.25 



1.00 



0.75 



solid: E 



(0) 

NMC+BCDMS 



dashes: E^f^c 



dotted: E 



(0) 



BCDMS 



0.50 



I I I I I I I 

200 400 



_i I I I I I I I I I I I 

600 800 1000 



cyclesx 10 



Figure 12: Proton: dependence of the error on the length of training. 



of the training process reflects the high precision of the data, especially for BCDMS. Clearly, 
there are two main difficulties: first, the BCDMS data arc only learnt very slowly, and second, 
the quahty of the fit to NMC data is poor, and does not improve (in fact, it slowly deteriorates) 
as the BCDMS data are learnt. 

Inspection of the fits to data of a single experiment (Figures 13-14) reveals that, similarly to 
the nonsinglet case, NMC data are learnt very fast whereas BCDMS data require a much longer 
training. However, unlike in the nonsinglet case, networks trained on data of one experiment 
are now entirely incapable of predicting data from the other experiment: as one average error 
slowly improves, the other deteriorates dramatically. This STiggests that an optimized training 
can again be obtained by giving more weight to the BCDMS data, however the imbalance 
cannot be too large because we cannot rely on one experiment being able to predict the other. 



An optimal combination is found when BCDMS data are shown 60% of the time and NMC 
data 40% of time (Figure 15). Both experiments then reach convergence after about 2 x 10^ 
training cycles. However, unlike in the nonsinglet case, the average error for NMC remains 
significantly worse than BCDMS. In fact, this feature is already present in fits to a single 
experiment of Figures 13-14: the average error for NMC at convergence, though learnt fast, 
remains rather worse than BCDMS. We found no way of obtaining better fits of NMC data 
even with NMC data only, much longer training, or larger architecture of the network. This 
appears therefore to be a feature of the data, to which we will come back shortly. 

It is interesting to observe that the constraint F2(1,Q^) = is crucial in order to ensure 
proper convergence of these fits. On the other hand, it should be noticed that, because this 
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Figure 13: Proton training with NMC data. Figure 14: Proton training with BCDMS 

data. 
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Figure 15: Proton weighted training (60% BCDMS, 40% NMC) 
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/ dat 
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/ dat 

p(^^P)) 0.38 0.17 0.52 

/ dat 

p("^*A 0.69 0.53 0.80 

/ dat 

r[p("^*)] 0.58 0.15 0.80 



V[cov("^*)]\ 1.2 X 10-9 2.7 X 10-9 1.3 x 10"^° 

. / dat 
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Table 5: Proton results 



constraint is only imposed with an absolute error of 10"'^, the structure function fits in the 
vicinity of x = 1 are only reliable within this accuracy, especially at low where no large 
X data are available. The final training parameters are chosen as follows: 9.6 x 10^ cycles 
with learning rate 77 = 9 x 10"'^ and error function Eq. (|2T|), followed by 2 x 10^ cycles with 
= 4 X 10"^ 60% of the times on BCDMS data and 40% of the times on NMC data and 
error function Eq. (p3D. The momentum term Eq. ( ]I8| ) is always set at a = 0.9. A sample of 
1000 networks trained in this way is used to produce values of the structure function and error 
which are compared to the data in Figure 16. The features of these networks are summarized 
in Table 5. 



36 



0.003<x<0.030 



2.25 



2.00 



1.50 



1.25 — 



1.00 



- + NMC data 

Neural Networks 






- 




x=0.0045 (+1.75) - 






x=0.0080 (+1.50) - 




s 


x=0.0125 (+1.25) - 






x=0.0175 (+1.00) - 


: * ¥ « 


^ s s ^ 


5 ^ x = 0.0250 (+0.75) 



3 4 5 6 7 8 910 

0.03<x<0.12 



20 30 



2.25 



2.00 



1.75 



1.50 



1.25 



1.00 



+ NMC data 
□ BCDMS data 
o Neural Networks 



* * 1 * 



1 S 



5 



10 



x=0.035 (+1.75) 



x=0.050 (+1.50) 



x-0.070 (+1.25) 



x=0.090 (+1.00) 
x=0.100 (+0.85) 
x=0.110 (+0.75) 

50 100 



0.12<x<0.80 



2.5 



2.0 



1.5 



1.0 



0.5 



0.0 



-0.5 



j1=0-140 ( + 2-00)- 
x=0.180 ( + 1.75)- 
S x=0.225 (+1.50)- 



jt=0.275 ( + 1.25) ^ ^ ^ 5 *B^a 
x=0.350 ( + 1.00) * * * ^ 

x=0.450 ( + 0.75) S ? ft 

x=0.550 ( + 0-50) SS! 

x=0.650 (+0.25) m 

x=0.750 

+ NMC data 
a BCDMS data 

o Neural Networks 

5 10 



50 100 



37 

Figure 16: Proton data and network prediction at the corresponding {x, Q"^) value. 
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Figure 17: Blow-up of proton data and pre- Figure 18: Proton NMC data at different 
dictions. beam energies, and associate predictions. 



Table 5 shows that all central values are well reproduced, even though the for NMC is 
rather high. Errors and correlations are very well reproduced for the BCDMS data. There does 
not appear to be any reduction of variance for this experiment: a*^"^*-' ~ g^y^d p(net) ^ p{exp). 

the networks simply reproduce the behaviour of the data. The prediction for the ratio Eq. (|57D 
7?. ~ 2 is again beautifully borne out by the data. The NMC data instead display significant 
variance reduction. However, the covariance is not reduced, indicating that systematics are 
faithfully reproduced, and the value of 71 is in good agreement with the expectation of Eq. (^8]), 
with a = (cr'-"^*^). The weak scatter correlation of variances can be understood analogously to 
the nonsinglet case, as a consequence of the fact that the nets can predict most NMC data 
based on the information provided by a subset of these data. 



We are therefore left with the problem of the large value of for NMC. The difficulty in 
obtaining a satisfactory fit of NMC data was already noted by other groups 0, Q in the context 
of global parton fits. The origin of the problem can be understood by taking a closer look at the 
NMC data, especially in the central kinematic region (middle plot in Figure 16). It is clear that 
a sizable fraction of these data are not compatible with the remaining ones: some data points 
deviate by several standard deviation from other measurements performed at equal or almost 
equal values of (x, Q'^). This may suggested underestimated systematics. A blow-up of the data, 
however (Figure 17), reveals that NMC data not only disagree with BCDMS data with the same 
(x, Q^), but in fact also with other NMC data at neighbouring (x, Q^). Furthermore (Figure 
18) the inconsistency persists even within sets of data taken by NMC at fixed beam energy. 
Therefore, none of the known sources of systematics can cause this discrepancy. Inspection 
reveals that the data which appear to be display the largest disagreement are always at the 
edge of a bin [e.g., the largest or smallest value for given x). The cumulative effect of these 
inconsistencies leads to a very large which cannot be reasonably attributed to a statistical 
fluctuation. 
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Figure 19: Deuteron: dependence of the error on the length of training. 

Whereas we are unable to pinpoint the precise origin of the effect, we conclude that a 
subset of the NMC data are not consistent with remaining information on the proton structure 
function, and that this inconsistency cannot be cured by enlarging any of the experimental 
systematics. The neural network correctly discards these points, and the relatively large value 
of signals this inconsistency in the data. 

A consequence of the relatively poor quality of the fit to NMC proton data is that in this 
case extrapolation at large is as unreliable as extrapolation at low Q^, because the large 
region is dominated by the NMC data. Interpolation, as well as extrapolation at large x in the 
region where low x data are available remain quite accurate. 

5.3.2 Deuteron 

The training of networks on deuteron data does not present specific problems. The dependence 
of the average error on the training length is displayed in Figure 19, and shows reasonable 
fast convergence after about 10* training cycles. Fits to individual data sets are displayed in 
Figures 20-21: BCDMS data are learnt more slowly and have a slightly better (E^^^) than 
NMC data. Training on individual data sets shows that the two data sets cannot be used to 
predict each other, in that the average error of the predicted set is poor, even though it does 
not deteriorate as the training progresses. No weighted training is necessary. 



The final training parameters are chosen as follows: 9.6 x 10^ cycles with learning rate 
rj = 9 X 10~^ and error function Eq. (|2T|), followed by 2 x 9 x 10^ cycles with rj = Ax 10~* and 
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Figure 20: Deuteron training with NMC data. Figure 21: Deuteron training with BCDMS 

data. 



error function Eq. The momentum term Eq. ( ]TB| ) is always set at a = 0.9. A sample of 

1000 networks trained in this way is used to produce values of the structure function and error 
which are compared to the data in Figure 22. The features of these networks are summarized 
in Table 6. 

The results of Table 6 show that central values and errors are well reproduced. There is 
a moderate amount of variance reduction for BCDMS data, and a somewhat larger reduction 
for NMC. The covariance is well reproduced, indicating that systematics are reproduced in a 
faithful way. The features interpolation and extrapolation are similar as in the nonsiglet case: 
interpolation and extrapolation at large x where small x data are available are quite accurate, 
extrapolation at large by a small amount is tolerable, while extrapolation at small or at 
small X are subject to large uncertainties. 
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;ure 22: Deuteron data and network prediction at the corresponding (x, Q^) value. 
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1.10 
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2.64 


2.74 


2.53 


n 


0.96 


0.90 


1.04 


f |'_p("ei)j 


0.998 
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0.999 



9.1 X 10-5 1.4 X 10"^ 3.2 x 10"^ 

/ dat 

PE[a^^^^)]\ 61% 66% 56% 

/ dat 

a^^^P)) 0.010 0.014 0.006 

/ dat 

(t("^*)) 0.006 0.007 0.004 

/ dat 

r[a(""*)] 0/72 0^34 0.94 

(U7 (m an 

/ dat 

p(^^pA 0.31 0.22 0.43 

/ dat 



(net)\j 

r[p("^*)']"^ 0.50 0.27 0.69 



p(""*)) 0.63 0.57 0.71 

I dat 



V[cov("^*)]) 3.4 X 10^1° 1.2 X 10-9 2.1 x 10"^° 

/ dat 

(cov(^^'pA 3.2 X 10-5 4.0 X 10"^ 2.2 x 10-^ 

\ / dat 

(cov(""*M 2.6 X 10-5 3.0 x 10-^ 1.3 x 10-^ 

\ / dat 

r[cov("^*^] (m 0^63 0.90 

Table 6: Deuteron results 

6 Summary 

We have presented a determination of the probability density in the space of structure func- 
tions for the structure function F2 for proton, deuteron and nonsinglet, as determined from 
experimental data of the NMC and BCDMS collaborations. Our results, for each of the three 
structure functions, take the form of a set of 1000 neural nets, each of which gives a determi- 
nation of F2 for given x and Q^. The distribution of these functions is a Monte Carlo sampling 
of the probability density. This Monte Carlo sampling has been obtained by first, producing a 
sampling of the space of data points based on the available experimental information through 
a set of Monte Carlo replicas of the original data, and then, training each neural net to one of 
these replicas. 

In practice, all functions are given by a FORTRAN routine which reproduces a feed-forward 
neural network (described in Section 3) entirely determined by a set of 47 real parameters. Each 
function is then specified by the set of values for these parameters. Our results are available at 
the web page |http : //sophia . ecm . ub . es/f 2neural/ . The full set of FORTRAN routines and 



parameters can be downloaded from this page. On-line plotting and computation facilities for 
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structure functions, errors and point-to-point correlations are also available through this web 
interface. 

Given the Monte Carlo sample of the probability density, any functional of the structure 
function, as well as errors and correlations, can be computed by averaging over the sample 
according to Eq. (^91) . The sample has been carefully tested against the available data to 
produce reliable results wherever data are available. Because the data provide a rather fine 
scanning of the x, plane (Figure 1), this representation of the probability density is expected 
to be very accurate throughout the region of the data. However, care should be adopted in 
extrapolating results far from the data region. On the other hand, we have verified that as one 
gets further from the data region, the spread of the probability density rapidly increases: so, 
errors computed from the Monte Carlo sample should be a good indicator of the point where 
the extrapolation becomes unreliable. 

Our results can be useful for a variety of applications to the precision phenomenology of 
deep-inelastic scattering. A particularly promising application consists of coupling the bias- 
free determination of structure functions discussed here with the bias-free technique for the 
description of QCD scaling violations based on the use of truncated moments [^]. In this 
approach, QCD evolution equations are directly expressed in terms of the scale dependence of 
the contribution to moments of structure functions from the experimentally accessible region. 
A precision determination of the strong coupling based on these techniques will be presented 
in a companion paper 
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In conclusion, this is the first complete determination of a probability density in a space of 
structure functions. As such, it provides an example of a bias-free determination of a function 
with its associate error from a discrete set of experimental data, which is a problem of foremost 
importance in the determination of errors on parton distributions. 
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